Skip to content

PYTHON-2433 Fix Python 3 ServerDescription/Exception memory leak #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 21, 2020

Conversation

ShaneHarvey
Copy link
Member

When the SDAM monitor check fails, a ServerDescription is created from
the exception. This exception is kept alive via the
ServerDescription.error field. Unfortunately, the exception's traceback
contains a reference to the previous ServerDescription. Altogether this
means that each consecutively failing check leaks memory by building an
ever growing chain of ServerDescription -> Exception -> Traceback ->
Frame -> ServerDescription -> ... objects.

This change breaks the chain and prevents the memory leak by clearing
the Exception's traceback, context, and cause fields.

When the SDAM monitor check fails, a ServerDescription is created from
the exception. This exception is kept alive via the
ServerDescription.error field. Unfortunately, the exception's traceback
contains a reference to the previous ServerDescription. Altogether this
means that each consecutively failing check leaks memory by building an
ever growing chain of ServerDescription -> Exception -> Traceback ->
Frame -> ServerDescription -> ... objects.

This change breaks the chain and prevents the memory leak by clearing
the Exception's __traceback__, __context__, and __cause__ fields.
@ShaneHarvey
Copy link
Member Author

Rebased to fix test failures caused by PYTHON-2436.

# If a bug like PYTHON-2433 is reintroduced then too many
# ServerDescriptions will be kept alive and this test will fail:
# AssertionError: 4 != 22 within 5 delta (18 difference)
self.assertAlmostEqual(initial_count, final_count, delta=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test. What is behind the choice of using delta=5? Is that number 5 liable to change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 is arbitrary but it allows for some tolerance for race conditions, like having several clients open in the background (creating and destroying ServerDescriptions).

@@ -69,6 +70,11 @@ def __init__(
self._me = ismaster.me
self._last_update_time = _time()
self._error = error
# PYTHON-2433 Clear error traceback info.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you are purging the traceback, etc. in this location as opposed to after emitting the heartbeat failed event in Monitor._check_server? Why preclude any exception on a ServerDescription from having a traceback?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is a bit odd to mutate the exception in the ServerDescription constructor. I moved this logic to the Monitor.

Copy link
Contributor

@prashantmital prashantmital left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job identifying this so quickly. It would have been a real doozy to figure out independently!

@ShaneHarvey
Copy link
Member Author

Alternative fixes would be:

  • refactor Monitor's code such that exceptions won't hold references to previous ServerDescriptions. I rejected this approach because it seems too brittle.
  • Use traceback.clear_frames() to clear the references in the traceback. For some reason this doesn't fix the bug.
  • Change ServerDescription.error to be a string representation of the error (repr(error)). This is probably the most robust approach but it's a backwards breaking change and it breaks a few tests.

@ShaneHarvey ShaneHarvey merged commit 6c92e6c into mongodb:master Nov 21, 2020
@ShaneHarvey ShaneHarvey deleted the PYTHON-2433 branch November 21, 2020 02:59
ShaneHarvey added a commit that referenced this pull request Nov 21, 2020
When the SDAM monitor check fails, a ServerDescription is created from
the exception. This exception is kept alive via the
ServerDescription.error field. Unfortunately, the exception's traceback
contains a reference to the previous ServerDescription. Altogether this
means that each consecutively failing check leaks memory by building an
ever growing chain of ServerDescription -> Exception -> Traceback ->
Frame -> ServerDescription -> ... objects.

This change breaks the chain and prevents the memory leak by clearing
the Exception's __traceback__, __context__, and __cause__ fields.

(cherry picked from commit 6c92e6c)
ShaneHarvey added a commit that referenced this pull request Nov 21, 2020
When the SDAM monitor check fails, a ServerDescription is created from
the exception. This exception is kept alive via the
ServerDescription.error field. Unfortunately, the exception's traceback
contains a reference to the previous ServerDescription. Altogether this
means that each consecutively failing check leaks memory by building an
ever growing chain of ServerDescription -> Exception -> Traceback ->
Frame -> ServerDescription -> ... objects.

This change breaks the chain and prevents the memory leak by clearing
the Exception's __traceback__, __context__, and __cause__ fields.

(cherry picked from commit 6c92e6c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants